Effective Implementation of DGEMM on Modern Multicore CPU

نویسندگان

  • Pawel Gepner
  • Victor Gamayunov
  • David L. Fraser
چکیده

In this paper we will present a detailed study on tuning double-precision matrix-matrix multiplication (DGEMM) on the Intel Xeon E5-2680 CPU. We selected an optimal algorithm from the instruction set perspective as well software tools optimized for Intel Advance Vector Extensions (AVX). Our optimizations included the use of vector memory operations, and AVX instructions. Our proposed algorithm achieves a performance improvement of 33% compared to the latest results achieved using the Intel Math Kernel Library DGEMM subroutine.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Evaluation of DGEMM Implementation on Intel Xeon Phi Coprocessor

In this paper we will present a detailed study of implementing double-precision matrix-matrix multiplication (DGEMM) utilizing the Intel Xeon Phi Coprocessor. We discuss a DGEMM algorithm implementation running "natively" on the coprocessor, minimizing communication with the host CPU. We will run DGEMM across a range of matrix sizes natively as well using Intel Math Kernel Library. Our optimiza...

متن کامل

Accurate CPU Power Modeling for Multicore Smartphones

CPU is a major source of power consumption in smartphones. Power modeling is a key technology to understand CPU power consumption and also an important tool for power management on smartphones. However, we have found that existing CPU power models on smartphones are ill-suited for modern multicore CPUs: they can give high estimation errors (up to 34%) and high estimation accuracy variation (mor...

متن کامل

Is Cache Oblivious DGEMM a Viable Alternative?

We present an in-depth study of various implementations of DGEMM, using both the recursive and iterative programming styles. Recursive algorithms for DGEMM are usually cache-oblivious and they automatically block DGEMM’s operands A, B, C for the memory hierarchy. Iterative algorithms for DGEMM explicitly block A, B, C for the L1 cache, higher caches and memory. Our study shows that recursive DG...

متن کامل

A Fast GEMM Implementation On a Cypress GPU

We present benchmark results of optimized dense matrix multiplication kernels for Cypress GPU. We write general matrix multiply (GEMM) kernels for single (SP), double (DP) and double-double (DDP) precision. Our SGEMM and DGEMM kernels show ∼ 2 Tflop/s and ∼ 470 Glop/s, respectively. These results for SP and DP correspond to 73% and 87% of the theoretical performance of the GPU, respectively. Cu...

متن کامل

Structured Orthogonal Inversion of Block p-Cyclic Matrices on Multicore with GPU

We present a block structured orthogonal factorization (BSOF) algorithm and its parallelization for computing the inversion of block pcyclic matrices. We aim at the high performance on multicores with GPU accelerators. We provide a quantitative performance model for optimal host-device load balance, and validate the model through numerical tests. Benchmarking results show that the parallel BSOF...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012